SIMPITIKI: a Simplification corpus for Italian

نویسندگان

  • Sara Tonelli
  • Alessio Palmero Aprosio
  • Francesca Saltori
چکیده

English. In this work, we analyse whether Wikipedia can be used to leverage simplification pairs instead of Simple Wikipedia, which has proved unreliable for assessing automatic simplification systems, and is available only in English. We focus on sentence pairs in which the target sentence is the outcome of a Wikipedia edit marked as ‘simplified’, and manually annotate simplification phenomena following an existing scheme proposed for previous simplification corpora in Italian. The outcome of this work is the SIMPITIKI corpus, which we make freely available, with pairs of sentences extracted from Wikipedia edits and annotated with simplification types. The resource contains also another corpus with roughly the same number of simplifications, which was manually created by simplifying documents in the administrative domain. Italiano. In questo lavoro si analizza la possibilità di utilizzare Wikipedia per selezionare coppie di frasi semplificate. Si propone questa soluzione come un’alternativa a Simple Wikipedia, che si è dimostrata inattendibile per studiare la semplificazione automatica ed è disponibile solo in inglese. Ci concentriamo soltanto su coppie di frasi in cui la frase target è indicata come il frutto di una modifica in Wikipedia, indicata dagli editor come un caso di semplificazione. Tali coppie sono annotate manualmente secondo una classificazione delle tipologie di semplificazione già utilizzata in altri studi, e vengono rese liberamente disponibili nel corpus SIMPITIKI. La risorsa include anche un secondo corpus, contenente circa lo stesso numero di semplificazioni, realizzato intervenendo manualmente su alcuni documenti nel dominio amministrativo.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification

In this paper we present PaCCSS–IT, a Parallel Corpus of Complex–Simple Sentences for ITalian. To build the resource we develop a new method for automatically acquiring a corpus of complex–simple paired sentences able to intercept structural transformations and particularly suitable for text simplification. The method requires a wide amount of texts that can be easily extracted from the web mak...

متن کامل

Design and Annotation of the First Italian Corpus for Text Simplification

In this paper, we present design and construction of the first Italian corpus for automatic and semi–automatic text simplification. In line with current approaches, we propose a new annotation scheme specifically conceived to identify the typology of changes an original sentence undergoes when it is manually simplified. Such a scheme has been applied to two aligned Italian corpora, containing o...

متن کامل

Assessing the Readability of Sentences: Which Corpora and Features?

The paper investigates the problem of sentence readability assessment, which is modelled as a classification task, with a specific view to text simplification. In particular, it addresses two open issues connected with it, i.e. the corpora to be used for training, and the identification of the most effective features to determine sentence readability. An existing readability assessment tool dev...

متن کامل

MUSST: A Multilingual Syntactic Simplification Tool

We describe MUSST, a multilingual syntactic simplification tool. The tool supports sentence simplifications for English, Italian and Spanish, and can be easily extended to other languages. Our implementation includes a set of general-purpose simplification rules, as well as a sentence selection module (to select sentences to be simplified) and a confidence model (to select only promising simpli...

متن کامل

Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings

Methods for text simplification using the framework of statistical machine translation have been extensively studied in recent years. However, building the monolingual parallel corpus necessary for training the model requires costly human annotation. Monolingual parallel corpora for text simplification have therefore been built only for a limited number of languages, such as English and Portugu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016